Development of clinical phenotypes and biological profiles via proteomic analysis of trauma patients

Background Trauma is a heterogeneous condition, and specific clinical phenotypes may identify target populations that could benefit from certain treatment strategies. In this retrospective study, we determined clinical phenotypes and identified new target populations of trauma patients and their treatment strategies. Methods We retrospectively analyzed datasets from the Japan Trauma Data Bank and determined trauma death clinical phenotypes using statistical machine learning techniques and evaluation of biological profiles. Results The analysis included 71,038 blunt trauma patients [median age, 63 (interquartile range [IQR], 40–78) years; 45,479 (64.0%) males; median Injury Severity Score, 13 (IQR, 9–20)], and the derivation and validation cohorts included 42,780 (60.2%) and 28,258 (39.8%) patients, respectively. Of eight derived phenotypes (D-1–D-8), D-8 (n = 2178) had the highest mortality (48.6%) with characteristic severely disturbed consciousness and was further divided into four phenotypes: D-8α, multiple trauma in the young (n = 464); D-8β, head trauma with lower body temperature (n = 178); D-8γ, severe head injury in the elderly (n = 957); and D-8δ, multiple trauma, with higher predicted mortality than actual mortality (n = 579). Phenotype distributions were comparable in the validation cohort. Biological profile analysis of 90 trauma patients revealed that D-8 exhibited excessive inflammation, including enhanced acute inflammatory response, dysregulated complement activation pathways, and impaired coagulation, including downregulated coagulation and platelet degranulation pathways, compared with other phenotypes. Conclusions We identified clinical phenotypes with high mortality, and the evaluation of the molecular pathogenesis underlying these clinical phenotypes suggests that lethal trauma may involve excessive inflammation and coagulation disorders. Supplementary Information The online version contains supplementary material available at 10.1186/s13054-022-04103-z.

development of new treatments is complicated by the heterogeneity of trauma due to age, sex, comorbidities, injury type and degree, and complex pathophysiology. Therefore, correctly determining the effects of therapeutic interventions remains intractable. Studies using large-scale registry data have attempted to validate interaction effects to identify groups with fatal polytrauma and reported that specific combinations of injuries significantly affect patient outcomes [2]. Thus, in analyses that consider the effects of complex, nonlinear interactions may reveal the pathological conditions that interact with multiple factors.
Recent studies investigated new therapeutic targets by combining unsupervised learning and biological indicators in various diseases to elucidate potential sub-phenotypes [3][4][5][6][7]. Identifying trauma sub-phenotypes with poor outcomes and complex pathologies may enable the discovery of new therapeutic strategies and target populations. Recent advances in technology have allowed researchers to acquire comprehensive biomolecular information. Following trauma, damagerelated molecular patterns bind to pattern recognition receptors expressed on immunocompetent cells, followed by activated intracellular transcription factors binding to nuclear DNA and promoting upregulated transcription and translation of target genes and a systemic inflammatory response. Thus, proteomic analysis of blood might broaden the understanding of the molecular pathology underlying trauma.
In this study, we identified latent clinical phenotypes in trauma patients, with the primary goal of identifying lethal clinical phenotypes with high mortality rates based on available clinical information. The secondary goal was to clarify the molecular pathology of the derived clinical phenotypes by analyzing biological data, including proteomic data.

Overview
This study included two datasets and used several statistical approaches. The study scheme is outlined in Fig. 1. First, we divided the datasets from Japan Trauma Data Bank (JTDB) into two cohorts according to the registration period and identified clinical phenotypes from the derivation cohort using unsupervised clustering. We then derived the clinical phenotypes using the same clustering method in the validation cohort and subsequently analyzed the distribution of each component. We assessed the reproducibility and consistency of the two clinical phenotypes by performing hierarchical clustering analysis. After determining the correlation between the clinical phenotypes and biological markers of host response, we clustered trauma patients from another dataset into different clinical phenotypes and evaluated the molecular pathology of each cluster by analyzing serum proteomic Step

Data collection
We used JTDB data related to all blunt trauma patients. Patients who had non-direct transportation, suffered cardiopulmonary arrest on arrival, had an Injury Severity Score (ISS) of 75, were pregnant, or had significant missing data were excluded from the study. The primary outcome was all-cause mortality. In another dataset, we evaluated the association between clinical phenotypes and biological markers of host response for blunt trauma patients transferred to Osaka University Hospital from 2017 to 2021 (Osaka University cohort). In this cohort, a datasheet containing clinical data necessary (Additional file 1: Methods: Biological Correlates and Clinical Outcomes) for clustering was generated, and biological information was collected. After incorporating the Osaka University cohort into the derivation cohort and determining clusters for each case in the former cohort, biological profiling was performed.

Candidate clinical variables for phenotyping
To generate a model applicable to early trauma care, we restricted the candidate variables to the following two conditions: (1) those related to trauma outcome and pathophysiology and (2) those included as the general information obtained during initial trauma care. Subsequently, 14 candidate variables were selected. We generated a data sheet and included the following information to understand the baseline characteristics of the patients: age, sex, existing comorbidities, vital signs on arrival, AIS code (AIS 90, update 98) [8], ISS [9], Revised Trauma Score [10], and a Trauma and Injury Severity Score and Probability of Survival (TRISS-PS) [11]. The ISS was calculated from the top three scores of the AIS of the six trunks classified by the AIS code.

Statistical analysis
Based on our aims to (1) develop and evaluate new clinical phenotypes related to traumatic death and (2) understand the molecular pathology underlying the derived clinical phenotypes, we first assessed the distribution of the candidate variables and instances where they were absent. The 5-year trauma data were split to obtain a ratio of approximately 6:4 for the sample size used to derive and verify the phenotype (derivation: January 2013-June 2015; validation: July 2015-December 2017) [12]. To avoid multicollinearity, we excluded candidate variables with an absolute correlation coefficient > 0.5 [13]. We derived clinical phenotypes associated with trauma-related death and performed two-step clustering considering calculation cost. We calculated the appropriate number of clusters by the mean silhouette and k-means methods [14,15]. After standardizing patient data, clustering was performed using the k-means algorithm based on the optimal number of clusters, and the silhouette coefficient was calculated according to Euclidean distance. Negative silhouettes were removed, and the survival of each phenotype was evaluated. The clinical phenotype with the highest mortality rate was then extracted and clustered a second time. For the second clustering, a group was derived using latent class analysis (LCA), where the Bayesian information criterion (BIC), appropriate size of each phenotype, and misclassification rate of each phenotype were evaluated to confirm the optimal phenotype number [16,17]. The optimal class number was selected based on the largest BIC considering the misclassification rate and interpretability [Additional file 1: Methods: Latent class analysis (LCA) and Calculation of BIC with LCA] [6]. The proportion of patients assignable to a phenotype at the margin was determined as 45% to 55%. To confirm the robustness of the phenotype, consensus k-means clustering was performed using 14 variables (Additional file 1: Methods: Consensus K clustering) [18]. To determine the optimal number of phenotypes, we evaluated the number of patients included in each phenotype, clear separation of the consensus matrix heatmaps, characteristics of the consensus cumulative distribution function plots, and appropriate pairwise consensus values between the clusters (> 0.8). Additionally, we visualized t-distributed stochastic neighbor embedding (t-SNE) to confirm the reproducibility between LCA and consensus k-means clustering [Additional file 1: Methods: Data visualizing/t-Distributed Stochastic Neighbor Embedding (t-SNE) plot] [19]. Upon determining the clinical phenotype, we generated a complex heatmap, violin plots, and alluvial plots to visualize the distribution of clinical explanatory variables in order to evaluate the features of each clinical phenotype (Additional file 1: Methods: Data visualizing/ Alluvial plot).
Clustering in the validation cohort was performed as described for the derivation cohort. We statistically evaluated the reproducibility as follows: 1) hierarchical clustering by principal component scores of each phenotype and 2) visualization of the centroid of each phenotype with size based on the number of patients by principal component analysis. Additionally, we collected clinical information to cluster patients who were transferred to the Department of Traumatology and Acute Critical Medicine at Osaka University Graduate School of Medicine from February 2017 to March 2021 and incorporated these data into the derivation cohort dataset, with clustering performed as described. The high-mortality group included all cases, whereas the remaining seven clinical phenotypes included ≤ 14 patients close to the centroid in each group.
To evaluate the biological characteristics of each phenotype, we procured general laboratory data of the patients using samples taken upon their arrival to the hospital and performed proteomic analysis of the serum collected within 72 h of injury using mass spectrometry (Additional file 1: Methods: Mass spectrometry). Volcano plot analysis was performed using the limma voom algorithm [20,21] to identify differentially expressed proteins between the high-mortality phenotype and other phenotypes. Differential protein expression was defined as a false discovery rate < 0.2 and a fold change >|1.2|. Subsequently, Gene Ontology (GO) enrichment analysis was performed using the R package clusterProfiler [22]. Patient characteristics data are described as the mean (standard deviation) or median [interquartile range (IQR)]. Mann-Whitney U tests, analysis of variance, and Kruskal-Wallis tests were used to compare continuous data, and the chi-square test was performed for categorical data. The threshold of statistical significance was p < 0.05 according to a two-sided test. No adjustment was made to the type I error rate by multiple comparisons; therefore, these results should be considered exploratory.

Study population
From 2013 to 2017, 158,918 trauma patients were enrolled in JTDB. Of these, 12,565 non-blunt trauma patients and 75,315 patients who did not meet the inclusion criteria were excluded, leaving 71,038 patients for the final analyses (Additional file 1: Fig. S1). Patients admitted from January 2013 to June 2015 were included in the derivation cohort (n = 42,780; 60.2%), and patients admitted from July 2015 to December 2017 were included in the validation cohort (n = 28,258; 39.8%) (Fig. 1). Baseline characteristics of all patients are listed in Table 1. Overall, the median age was 63 years (IQR 40-78 years), median ISS was 13 (IQR 9-20), median TRISS-PS was 0.97 (IQR 0.93-0.99), and the in-hospital mortality rate was 5.5%. The biological profile cohort included 171 patients, with a median age of 50 years (IQR 34-71 years), a median ISS of 17 (IQR 6-26), and median TRISS-PS of 0.97 (IQR 0.84-0.99), and an in-hospital mortality rate of 5.8%. This cohort  was incorporated into the derivation cohort, and clustering analysis was performed using the same method. Additionally, we analyzed biological data for 90 individuals whose coordinates were close to the centroid of each phenotype.

Derivation of clinical trauma phenotypes
We examined correlations among 14 candidate variables and found that there were no variables with absolute correlation coefficients > 0.5 (Additional file 1: Fig. S2). In the derivation cohort, the optimal number of phenotypes was identified as 8 according to the mean silhouette and the k-means method (Additional file 1: Figs. S3 and 4). Additional file 1: Table S1 shows the characteristics of 38,097 patients after removing the negative silhouettes (Additional file 1: Fig. S5). Subsequently, we identified the clinical phenotypes with high mortality rates ( Fig. 1 and Additional file 1: Table S1) and performed LCA on the phenotype with the highest mortality rate (Additional file 1: Fig. S6). The BIC for the LCA model increased continuously with the number of classes, with changes in the BIC decreasing when the number of classes was ≥ 4. Overall, the four-class model was the best-fitting model because of its low misclassification rate and interpretability (Additional file 1: Fig. S7) and showed strong separation in the likelihood of membership for patients assigned to a given phenotype rather than to other phenotypes (Additional file 1: Figs. S8 and 9). The discriminative power of each variable in LCA was subsequently confirmed (Additional file 1: Fig. S10). Additional file 1: Table S2 shows the patient number and baseline characteristics of the four clinical phenotypes derived from the high-mortality cluster in the derivation cohort, with the clinical features of the high-mortality group shown in violin and alluvial plots (Additional file 1: Figs. S11 and 12). Severe head trauma with impaired consciousness and a Glasgow Coma Scale ≤ 8 were common in the highmortality group. The clinical phenotype D-8α is comprised predominantly of young individuals with chest, limb, and pelvic trauma complications (29.2% mortality). Phenotype D-8β mainly included elderly people with head trauma and hypothermia on hospital arrival (46.6% mortality). Clinical phenotype D-8γ had the largest number of patients in the high-mortality group, with the highest mortality rate (56.6%) observed in the elderly with severe head trauma. Phenotype D-8δ was characterized by polytrauma and the highest ISS, with many individuals with this phenotype having chest, extremity, and pelvic trauma. The mortality rate for this phenotype was 51.6%, with the predicted mortality rate considerably higher than the actual mortality rate.

Evaluation of reproducibility
In the validation cohort, clinical phenotypes were derived from silhouette analysis and LCA. The optimal number of phenotypes was 8 when the mean silhouette method and the k-means method were used, as was the case for the derived cohort (Additional file 1: Fig. S13). The LCA for high-mortality phenotypes optimally classified the phenotypes into four phenotypes based on the BIC results. Moreover, consensus k-means clustering for the high-mortality group confirmed its robustness, suggesting that the t-SNE results were comparable with the LCA classification (Additional file 1: Fig. S14). We performed the same clustering for the validation cohort (Additional file 1: Figs. S16-27 and Tables S3 and 4). Figure 2 shows the survival rate distribution and each variable for the phenotypes generated in the derivation and validation cohorts. The high-mortality phenotype (D-8/V-8) was characterized by severely impaired consciousness, lower body temperature, and a higher degree of severe head trauma. Survival analysis revealed that this phenotype (D-8/V-8) showed decreased survival over time (Additional file 1: Figs. S15 and 28). Hierarchical clustering based on principal component scores confirmed that high-mortality clusters were statistically paired (Fig. 3a). We subsequently plotted the principal component coordinates of the calculated centroids according to the numbers of patients and visualized pairs with similar phenotypes in each cohort ( Fig. 3b and Additional file 1: Fig. S29).

Correlation of clinical phenotypes with biomarker profiles
The patient characteristics of the cohort for the biological profile are shown in Additional file 1: Table S5. We removed decoy proteins and immunoglobulins and used 256 proteins according to their standing according to the total exponentially modified protein abundance index summation [24]. Among the differentially expressed proteins, expression levels of 11 and 26 proteins were significantly upregulated and downregulated, respectively ( Fig. 4a and Additional file 1: Tables S6 and 7). Characteristics of patients with high-mortality phenotypes in all three cohorts were compared and are presented in Table 2. GO enrichment analysis showed that clinical phenotype B-8 (equivalent to phenotype D-8/V-8) exhibited excessive inflammation, including enhanced acute inflammatory response and dysregulated complement activation pathways, and impaired coagulation, including downregulated coagulation and platelet degranulation pathways (Fig. 4b). When continuous variables were ranked according to the standardized mean difference between phenotypes (Fig. 5), phenotype B-8 showed higher laboratory values related to coagulation, fibrinolysis, and inflammation and lower values for fibrinogen and platelets relative to the other phenotypes.

Discussion
This analysis revealed 11 tentative, mutually exclusive clinical phenotypes that are multidimensional and exhibit differential patient characteristics along with distinct laboratory data and patterns of organ damage according to the injured region. The frequency and characteristics of the clinical phenotypes were confirmed as robust and reproducible using different cohorts and machine learning methods. Furthermore, biological analysis indicated that the high-mortality phenotype showed excessive inflammation and coagulation dysfunction. The findings suggest that these phenotypes can be evaluated during the initial care of trauma patients to help develop treatment strategies for each respective phenotype and establish inclusion criteria for future clinical trials. Previous studies subgrouped trauma patients according to patient characteristics and injured organs to search for risk factors [25][26][27]; however, no study has comprehensively captured all trauma patients using machine learning approaches to search for new target populations in trauma treatment and evaluate them biologically. Machine learning is advantageous for variable selection and modeling, as it considers complex interaction effects and nonlinearity with outcomes [28,29].
High-mortality groups exhibited a high degree of impaired consciousness and were further divided into a severe traumatic brain injury group and a polytrauma group according to reasonable clinical parameters. In trauma epidemiology, associations between combinations of injured regions and trauma-related mortality have been examined using a logistic model that included interaction effects and demonstrated significant interactions with mortality in head-chest and chest-pelvic/   The number in the cell is the median (sex is shown as a percentage of males). AIS, Abbreviated Injury Scale; BT, body temperature; GCS, Glasgow Coma Scale; SBP, systolic blood pressure; HR, heart rate; RR, respiratory rate; CPS, Charlson Polypharmacy Scale extremities injuries [2]. In the present study, clustering analysis revealed that injuries to the chest, pelvis, and extremities in the polytrauma group were associated with a high mortality rate. Moreover, the high-mortality group showed excessive inflammation, including a dysregulated acute inflammatory response and complement activation pathway, and coagulation abnormalities, which may play a major role in trauma death. This study has two clinical implications. First, we used early trauma care data to identify clinical phenotypes with high mortality, identifying populations that may benefit from early intervention. Selecting sub-phenotypes of patients at high risk of poor outcomes and incorporating them into clinical trials is referred to as prognostic enrichment [30,31]. However, no previous report has identified sub-phenotypes in early trauma stages or attempted new clinical trials, suggesting that the data presented here may reveal new therapeutic target populations and treatment strategies. Additionally, this technique may help clinicians predict potential traumarelated deaths and treat these patients appropriately. Second, we considered phenotypic pathological features from biological data, which enhanced the understanding of the endotypes of trauma patients. Excessive inflammation and coagulation disorders in clinical phenotype D-8/ V-8 suggest the possibility of using preemptive treatment upon confirmation of the phenotype from initial clinical data. These findings may constitute a breakthrough for developing new trauma treatment strategies and therapeutic agents.
This study has a few limitations. First, this is a retrospective study, which has inherent limitations, as unmeasured confounding factors may affect trauma mortality. Second, with AIS coding, only the maximum value is adopted, even when there are multiple injuries in the same region. Third, we only used daily clinical data available from the electronic health records to identify clinical phenotypes. The JTDB datasets used in this study did not include blood test data; therefore, information on characteristics such as lactate levels and blood coagulation could not be used to derive clinical phenotypes. Moreover, future improvements in testing capabilities will likely help reveal biological indicators early in trauma care, which could result in completely different clinical phenotypes from those reported here. Fourth, the studied data were obtained only from Japanese patients. Therefore,

Centroid of each cluster by principal component analysis in all cohorts
High mortality phenotype Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 High mortality phenotype High mortality phenotype  the results may not be generalizable to other cohorts because of differences in trauma patient characteristics and medical practice across countries and regions. Fifth, the small sample size of the Osaka University cohort used for the biological profile may be statistically insufficient.
Comparison of high-mortality phenotypes in each cohort showed significant differences in factors such as heart rate and body temperature. However, demographic data of the high-mortality phenotypes in each cohort were similar, and the significant differences could be attributed to the small sample size of the Osaka University cohort. Sixth, the body undergoes drastic changes after trauma, especially during the acute phase. In this study, the median proteomic sampling time was 1 h (IQR 0.8-1.7 h), with a maximum of 56 h after injury. Thus, the variation in blood sample collection time may have affected the results of proteomic analysis.

Conclusions
In summary, this retrospective analysis using a nationwide trauma cohort for the Japanese population classified all trauma into 11 clinical phenotypes based on available clinical information acquired during the early stage of trauma care. We identified clinical phenotypes with high mortality, with the evaluation of the molecular pathogenesis of the derived clinical phenotypes suggesting that lethal trauma involves excessive inflammation and coagulation disorders.

Author contributions
JT conceived and designed this study; contributed to the acquisition, analysis, and interpretation of the data; and was responsible for drafting, editing, and submitting the manuscript. HM was involved in study conceptualization and significantly to data analysis, interpretation of the results, and manuscript preparation. FS performed protein mass spectrometry analysis and wrote the relevant parts of the manuscript. SS advised on bioinformatics analysis methods and interpretation of the results. DO contributed to the overall implementation of the biological profile and interpretation of the results. Tetsuhisa Kitamura and SK played a significant role in data analysis and helped in drafting the manuscript. YK provided practical advice and support for calculations using supercomputers. Takashi Kojima and YT organized samples and helped with data collection. YK, YN, and HO provided significant input regarding data interpretation and critical appraisal of the manuscript. All authors contributed to data acquisition and reviewed, discussed, and approved the final version of the manuscript. All authors read and approved the final manuscript.

Funding
The authors declare that they have no sources of funding to report.